Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete anchors/headings with data from Webref #2426

Merged
merged 2 commits into from
Dec 19, 2022

Conversation

tidoust
Copy link
Contributor

@tidoust tidoust commented Dec 15, 2022

This proposes to complete the update mechanism that produces the cross-references anchors and headings to also use data from Webref, as proposed in #1761. To keep changes minimal and avoid introducing possibly conflicting anchors and headings here and there (and thus avoid breaking existing specs), data from Webref is only used for specs that are not in Shepherd's database.

In practice, this adds definitions and headings from 267 specifications (~60 of which only have headings), generating ~3.5MB of anchors data and ~5.2MB of headings data. Full list below. Only 3 specs related to CSS that are not (yet) in Shepherd: CSS Anchor Positioning, CSS Images Module Level 5, and CSS Parser API.

Produced anchors and headings look good to me but I don't really know how to validate them in practice.

There won't be duplicates in the sense that anchors/headings data will either come from Shepherd or from Webref but not from both. Adding new anchors data means that there may be more situations where a term is defined in more than one specification though. Existing specifications should be able to continue linking to terms they already use but some may need to add a few additional linking defaults. If that seems useful, it should be relatively easy to build a list of terms defined in more than one spec to better understand what specifications are going to be potentially affected.

(Side note that I'm not fluent in Python so code may need some re-writing. I tried to be explicit about types for instance but not sure I got that part right, and not sure when these types are actually checked).

List of Webref specs added to cross-ref database

This completes the update mechanism that produces the cross-references anchors
and headings to also use data from Webref. To keep changes minimal and avoid
introducing possibly conflicting anchors and headings here and there, data from
Webref is only used for specs that are not in Shepherd's database.

In practice, this adds definitions and headings from 267 specifications (~60 of
which only have headings), generating ~3.5MB of anchors data and ~5.2MB of
headings data.
@tidoust tidoust changed the title Webref anchors Complete anchors/headings with data from Webref Dec 15, 2022
@tabatkins
Copy link
Collaborator

Woah, this is super cool! I was in the middle of doing exactly this yesterday, but stopped for the day before I was finished, and now you've beat me to it.

I tested specs generated with the current db against this db, and mostly it looks fine. There are a lot more "ambiguous for" errors now, which cause things to stop linking, but that's the sort of problem you run into in general; it's unavoidable.

However, there are a handful of linking changes I think are legitimately wrong:

  • the PermissionState enum is defined in permissions, but it's also repeated in deviceorientation, and the way I resolve conflicts (at random, basically) means the latter is getting chosen. We should fix deviceorientation, and/or manually block it from defining those terms (the enum and its values) in reffy.
  • i18n-glossary conflicts with whatwg's Infra for a number of string-related terms: case-sensitive, code unit, surrogate, etc. The glossary is definitely much more informational rather than normative, though. I'm not sure we want that document to be contributing definitions?

@tidoust
Copy link
Contributor Author

tidoust commented Dec 16, 2022

  • the PermissionState enum is defined in permissions, but it's also repeated in deviceorientation, and the way I resolve conflicts (at random, basically) means the latter is getting chosen. We should fix deviceorientation, and/or manually block it from defining those terms (the enum and its values) in reffy.

We have a patching process in place for IDL terms to create curated IDL extracts and, indeed, one of them drops PermissionState from deviceorientation. Relevant pending pull request on the spec is at w3c/deviceorientation#88.

Problem here is that we don't have a similar curation mechanism in place for definitions (this is being tracked in w3c/webref#789). Maintaining patches consumes time so we tend to resist the temptation to create more places where they could appear ;) I'll try to find a workaround in Reffy!

  • i18n-glossary conflicts with whatwg's Infra for a number of string-related terms: case-sensitive, code unit, surrogate, etc. The glossary is definitely much more informational rather than normative, though. I'm not sure we want that document to be contributing definitions?

Argh, adding the glossary was requested by the i18n group precisely for cross-referencing purpose, see w3c/webref#465.

Instead of picking up a term at random, would it be possible for Bikeshed to prefer normative definitions over informative ones, and exported ones over non-exported ones? In that particular case, that would mean the definitions in Infra would always be chosen over the ones in i18n-glossary.

@tidoust
Copy link
Contributor Author

tidoust commented Dec 19, 2022

Problem here is that we don't have a similar curation mechanism in place for definitions (this is being tracked in w3c/webref#789). Maintaining patches consumes time so we tend to resist the temptation to create more places where they could appear ;) I'll try to find a workaround in Reffy!

There now is a crude mechanism in place to patch definitions that need to be. Webref definitions data no longer contains duplicate PermissionState definitions as a result. Goal is to only use the mechanism as last resort though, and only when the duplicates are clearly an error (so not for definitions in the Internationalization Glossary, where duplication seems intended by the spec authors, even though that's not ideal). Current duplicates in Webref's data are listed in w3c/webref#306 (comment) (the list only contains duplicates that match selection rules)

@tabatkins
Copy link
Collaborator

Excellent!

Re: the infra/i18n conflict, I already prefer exported dfns over unexported, but don't track whether a definition is normative or informative. (The concept of an informative definition is somewhat contradictory!) However, I do have a mechanism for preferring dfns from one spec over another when they're both possibilities, either at the individual dfn level or at the full spec level. I'll go ahead and deploy that to prefer Infra over i18n-glossary for the conflicting terms.

tabatkins added a commit that referenced this pull request Dec 19, 2022
@tabatkins tabatkins merged commit eacff10 into speced:main Dec 19, 2022
tabatkins added a commit that referenced this pull request Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants